2b or not 2b? 2bRAD is an effective alternative to ddRAD for phylogenomics

Abstract Restriction‐site‐associated DNA sequencing (RADseq) has become an accessible way to obtain genome‐wide data in the form of single‐nucleotide polymorphisms (SNPs) for phylogenetic inference. Nonetheless, how differences in RADseq methods influence phylogenetic estimation is poorly understood because most comparisons have largely relied on conceptual predictions rather than empirical tests. We examine how differences in ddRAD and 2bRAD data influence phylogenetic estimation in two non‐model frog groups. We compare the impact of method choice on phylogenetic information, missing data, and allelic dropout, considering different sequencing depths. Given that researchers must balance input (funding, time) with output (amount and quality of data), we also provide comparisons of laboratory effort, computational time, monetary costs, and the repeatability of library preparation and sequencing. Both 2bRAD and ddRAD methods estimated well‐supported trees, even at low sequencing depths, and had comparable amounts of missing data, patterns of allelic dropout, and phylogenetic signal. Compared to ddRAD, 2bRAD produced more repeatable datasets, had simpler laboratory protocols, and had an overall faster bioinformatics assembly. However, many fewer parsimony‐informative sites per SNP were obtained from 2bRAD data when using native pipelines, highlighting a need for further investigation into the effects of each pipeline on resulting datasets. Our study underscores the importance of comparing RADseq methods, such as expected results and theoretical performance using empirical datasets, before undertaking costly experiments.

FIGURE S2 Panels comparing the expected and observed character changes for each sampling depth, summarized for each combination of dataset and character state pattern. A goodness-of-fit test (chi-squared) was used to determine whether the distribution of number of changes for a dataset sample (blue bars) was different than expected at random (red bars). Asterisks indicate that the post-hoc comparison between observed and expected is significant at p < 0.0001, and n is the total observed changes for the given panel.

COST BREAKDOWN
Costs for all steps associated with our total dataset for 2bRAD and ddRAD protocols presented in this paper. Estimates for sequencing are from the pricing offered in 2017 at the UT Austin Genomic Sequencing and Analysis Facility; in-lab ddRAD costs are taken from Peterson et al.
Step 2bRAD (2012). p ddRAD sequencing was performed using the Illumina HiSeq 4000 technology with 150 bp single-end reads, which is able to sequence 240M reads/lane at a cost of $1,500/lane. This value was calculated using the average number of per-sample reads obtained for Rana and Epipedobates (average of 6.4M reads per sample). Thus, sequencing costs were (6.4*1,500)/240 = $40.00 per sample. For paired-end reads using the same size selection window and sequencing technology, the cost for 240M reads/lane is $2,500. Thus, sequencing costs for paired-end sequencing would be (6.4*2,500)/240 = $66.67 per sample. q This cost is for sequencing single-end reads; the total per sample cost for 150 bp paired-end reads is the laboratory costs ($68.48) in addition to sequencing costs ($66.67) = $135.15. r For ddRAD single-end reads, the complete iPyrad bioinformatics pipeline took >48 hours depending on sampling depth; for paired-end reads, we were unable to complete the bioinformatics analysis due time limits. Processing of both single-and paired-end reads requires high-powered cluster computers. s The ddRAD protocol requires experience to operate specialized equipment and perform protocols (e.g., experience using a Pippin Prep machine, performing magnetic bead cleanup using a SPRIPlate).

Whole genome sequencing costs
We also estimated the costs of performing whole genome sequencing (WGS) compared to our 2bRAD and ddRAD cost estimates. We estimated that in-house WGS library preparation costs are approximately $25 per sample. Sequencing costs are based on QB3 Genomics at the University of California Berkeley pricing; WGS costs $2.50 per GB genome size. For 10x coverage, WGS would cost $225 per sample in Epipedobates (estimated genome size: 9GB) and $175 per sample in Rana (estimated genome size: 6GB). Thus, combined library preparation and sequencing costs amount to $250 and $200 per sample (for Epipedobates and Rana, respectively).

ALLELIC DROPOUT AND PHYLOGENETIC SIGNAL: ADDITIONAL DETAILS
To derive the null expectation, one must consider the presence/absence of a locus across the ten tips of the tree. For example, if a locus is absent in four tips and present in six (0000111111), there are 210 different ways (permutations) to assign four 0s and six 1s to the ten uniquely labeled tips. For each of these permutations, we optimized the locus on the tree under Dollo parsimony and then calculated the number of evolutionary changes. The frequencies of each category of changes (1, 2, 3, ...) forms the null distribution of possible changes on the tree. In this example, the expected frequency is 2/210 for one change, 8/210 for two, 120/210 for three, and 80/210 for four. Thus, under the null only 0.95% of permutations (2/120) are expected to show perfect fit (one step), while 38.1% (80/210) are expected to show poor fit (four changes, maximum homoplasy).
For each dataset, the characters were divided into state-frequency classes (0011111111, 00011111111, ..., 0000000011) and the null expectation for each was generated by optimizing all unique permutations of states on the tree using the Permn function of the R package DescTools (Signorell 2020). Each permutation was optimized on the tree, and the number of changes was cumulated for each state-frequency class, similar to permutation tests for phylogenetic signal (Archie 1989).
For each dataset, the expected proportions for each state-frequency class were compared to the observed proportions for 1-5 changes, 5 being the maximum possible given the tree topologies. A chi-squared test was performed, with significance determined by 999 permutations; this yielded 28 chi-squared tests. Because the null hypothesis was (unsurprisingly) rejected at 0.001 in every case, we performed one-tailed post-hoc tests using the chi-squared standardized residuals (observed minus expected divided by the standard deviation of the expected) to determine whether the observed changes significantly exceeded the expected or not for each category (1-5 changes). A stringent critical value of 3.09 standard units (alpha=0.001, one-tailed test) was applied (scripts in Supporting information). For the total datasets we used R plots to compare the proportions of observed changes that exceeded the expected signal (i.e., significant signal) to those that did not.

DATA REPRODUCIBILITY
The following schematic details all files contained within Supplemental Information (available here: https://doi.org/10.5061/dryad.fbg79cnsp) required for each analysis. N.B.: In most of the raw data and output files, S. erasmios samples (RDT0158 and RDT0159) are mislabeled as S. nubicola. This was due to a misidentification of the individuals; none of our analyses were affected by this change. Files relevant for unambiguous change / binary recoded analyses 1-Permutation-Chisq-Analyses Code to run permutation (chi-square test) analysis on binary-recoded data. Contains scripts/and Results/ directories and an R project file (.Rproj). R project is used to generate Fig. S2